NSF PAR Search | NSF Public Access Repository

Note: When clicking on a Digital Object Identifier (DOI) number, you will be taken to an external site maintained by the publisher. Some full text articles may not yet be available without a charge during the embargo (administrative interval).
What is a DOI Number?

Some links on this page may take you to non-federal websites. Their policies may differ from this site.

ACES: Accelerating Sparse Matrix Multiplication with Adaptive Execution Flow and Concurrency-Aware Cache Optimizations

https://doi.org/10.1145/3620666.3651381

Lu, Xiaoyang; Long, Boyu; Chen, Xiaoming; Han, Yinhe; Sun, Xian-He (April 2024, ACM)

Sparse matrix-matrix multiplication (SpMM) is a critical computational kernel in numerous scientific and machine learning applications. SpMM involves massive irregular memory accesses and poses great challenges to conventional cache-based computer architectures. Recently dedicated SpMM accelerators have been proposed to enhance SpMM performance. However, current SpMM accelerators still face challenges in adapting to varied sparse patterns, fully exploiting inherent parallelism, and optimizing cache performance. To address these issues, we introduce ACES, a novel SpMM accelerator in this study. First, ACES features an adaptive execution flow that dynamically adjusts to diverse sparse patterns. The adaptive execution flow balances parallel computing efficiency and data reuse. Second, ACES incorporates locality-concurrency co-optimizations within the global cache. ACES utilizes a concurrency-aware cache management policy, which considers data locality and concurrency for optimal replacement decisions. Additionally, the integration of a non-blocking buffer with the global cache enhances concurrency and reduces computational stalls. Third, the hardware architecture of ACES is designed to integrate all innovations. The architecture ensures efficient support across the adaptive execution flow, advanced cache optimizations, and fine-grained parallel processing. Our performance evaluation demonstrates that ACES significantly outperforms existing solutions, providing a 2.1× speedup and marking a substantial advancement in SpMM acceleration.
more » « less
Full Text Available
CHROME: Concurrency-Aware Holistic Cache Management Framework with Online Reinforcement Learning

https://doi.org/10.1109/HPCA57654.2024.00090

Lu, Xiaoyang; Najafi, Hamed; Liu, Jason; Sun, Xian-He (March 2024, 30th IEEE International Symposium on High-Performance Computer Architecture (HPCA))

Cache management is a critical aspect of computer architecture, encompassing techniques such as cache replacement, bypassing, and prefetching. Existing research has often focused on individual techniques, overlooking the potential benefits of joint optimization. Moreover, many of these approaches rely on static and intuition-driven policies, limiting their performance under complex and dynamic workloads. To address these challenges, this paper introduces CHROME, a novel concurrencyaware cache management framework. CHROME takes a holistic approach by seamlessly integrating intelligent cache replacement and bypassing with pattern-based prefetching. By leveraging online reinforcement learning, CHROME dynamically adapts cache decisions based on multiple program features and applies a reward for each decision that considers the accuracy of the action and the system-level feedback information. Our performance evaluation demonstrates that CHROME outperforms current state-of-the-art schemes, exhibiting significant improvements in cache management. Notably, CHROME achieves a remarkable performance boost of up to 13.7% over the traditional LRU method in multi-core systems with only modest overhead.
more » « less
Full Text Available
An Evaluation of DAOS for Simulation and Deep Learning HPC Workloads

https://doi.org/10.1145/3578353.3589542

Logan, Luke; Lofstead, Jay; Sun, Xian-He; Kougkas, Anthony (May 2023, In Proceedings of the 3rd Workshop on Challenges and Opportunities of Efficient and Performant Storage Systems)
The Memory-Bounded Speedup Model and Its Impacts in Computing

https://doi.org/10.1007/s11390-022-2911-1

Sun, Xian-He; Lu, Xiaoyang (February 2023, Journal of Computer Science and Technology)

Full Text Available
CARE: A Concurrency-Aware Enhanced Lightweight Cache Management Framework

https://doi.org/10.1109/HPCA56546.2023.10071125

Lu, Xiaoyang; Wang, Rujia; Sun, Xian-He (February 2023, 2023 IEEE International Symposium on High-Performance Computer Architecture (HPCA))

Full Text Available
PMAlloc: A Holistic Approach to Improving Persistent Memory Allocation

https://doi.org/10.1145/3643886

Dang, Zheng; He, Shuibing; Zhang, Xuechen; Hong, Peiyi; Li, Zhenxin; Chen, Xinyu; Song, Haozhe; Sun, Xian-He; Chen, Gang (November 2024, ACM Transactions on Computer Systems)

Persistent memory allocation is a fundamental building block for developing high-performance and in-memory applications. Existing persistent memory allocators suffer from many performance issues. First, they may introduce repeated cache line flushes and small random accesses in persistent memory for their poor heap metadata management. Second, they use static slab segregation resulting in a dramatic increase in memory consumption when allocation request size is changed. Third, they are not aware of NUMA effect, leading to remote persistent memory accesses in memory allocation and deallocation processes. In this article, we design a novel allocator, named PMAlloc, to solve the above issues simultaneously. (1) PMAlloc eliminates cache line reflushes by mapping contiguous data blocks in slabs to interleaved metadata entries stored in different cache lines. (2) It writes small metadata units to a persistent bookkeeping log in a sequential pattern to remove random heap metadata accesses in persistent memory. (3) Instead of using static slab segregation, it supports slab morphing, which allows slabs to be transformed between size classes to significantly improve slab usage. (4) It uses a local-first allocation policy to avoid allocating remote memory blocks. And it supports a two-phase deallocation mechanism including recording and synchronization to minimize the number of remote memory access in the deallocation. PMAlloc is complementary to the existing consistency models. Results on six benchmarks demonstrate that PMAlloc improves the performance of state-of-the-art persistent memory allocators by up to 6.4× and 57× for small and large allocations, respectively. PMAlloc with NUMA optimizations brings a 2.9× speedup in multi-socket evaluation and is up to 36× faster than other persistent memory allocators. Using PMAlloc reduces memory usage by up to 57.8%. Besides, we integrate PMAlloc in a persistent FPTree. Compared to the state-of-the-art allocators, PMAlloc improves the performance of this application by up to 3.1×.
more » « less
Full Text Available
A Generalized Model for Modern Hierarchical Memory System

https://doi.org/10.1109/WSC57314.2022.10015298

Najafi, Hamed; Liu, Jason; Lu, Xiaoyang; Sun, Xian-He (December 2022, 2022 Winter Simulation Conference (WSC))

Memory system is critical to architecture design which can significantly impact application performance. Concurrent Average Memory Access Time (C-AMAT) is a model for analyzing and optimizing memory system performance using a recursive definition of the memory access latency along the memory hierarchy. The original C-AMAT model, however, does not provide the necessary granularity and flexibility for handling modern memory architectures with heterogeneous memory technologies and diverse system topology. We propose to augment C-AMAT to take into consideration the idiosyncrasies of individual cache/memory components as well as their topological arrangement in the memory architecture design. Through trace-based simulation, we validate the augmented model and examine the memory system performance with insight unavailable using the original C-AMAT model.
more » « less
Full Text Available
LabStor: A Modular and Extensible Platform for Developing High-Performance, Customized I/O Stacks in Userspace

https://doi.org/10.1109/sc41404.2022.00028

Logan, Luke; Garcia, Jaime Cernuda; Lofstead, Jay; Sun, Xian–He; Kougkas, Anthony (November 2022, IEEE/ACM International Conference for High Performance Computing, Networking, Storage and Analysis (SC22))
iCache: An Importance-Sampling-Informed Cache for Accelerating I/O-Bound DNN Model Training

https://doi.org/10.1109/HPCA56546.2023.10070964

Chen, Weijian; He, Shuibing; Xu, Yaowen; Zhang, Xuechen; Yang, Siling; Hu, Shuang; Sun, Xian-He; Chen, Gang (February 2023, the 29th IEEE International Symposium on High-Performance Computer Architecture (HPCA-29))
Accelerating Tensor Swapping in GPUs With Self-Tuning Compression

https://doi.org/10.1109/TPDS.2022.3193867

Chen, Ping; He, Shuibing; Zhang, Xuechen; Chen, Shuaiben; Hong, Peiyi; Yin, Yanlong; Sun, Xian-He (December 2022, IEEE Transactions on Parallel and Distributed Systems)

Full Text Available

« Prev Next »

Search for: All records